{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset Tutorial\n",
    "\n",
    "Let's first load several packages from DeepPurpose"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# if you are using source version, uncomment the next two lines:\n",
    "#import os\n",
    "#os.chdir('../')\n",
    "\n",
    "from DeepPurpose import utils, DTI, dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are mainly three types of input data for DeepPurpose.\n",
    "1. Target Sequence and its name to be repurposed.\n",
    "2. Drug repurposing library.\n",
    "3. Training drug-target pairs, along with the binding scores.\n",
    "\n",
    "There are two ways to load the data.\n",
    "\n",
    "The first is to use the DeepPurpose.dataset library loader, which is very simple and we preprocess the data for you. The list of dataset supported is listed here:\n",
    "https://github.com/kexinhuang12345/DeepPurpose/blob/master/README.md#data\n",
    "\n",
    "The second way is to read from local files, which should follow our data format, as we illustrated below. \n",
    "\n",
    "Here are some examples. First, let's show how to load some target sequences for COVID19."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The target is: SGFKKLVSPSSAVEKCIVSVSYRGNNLNGLWLGDSIYCPRHVLGKFSGDQWGDVLNLANNHEFEVVTQNGVTLNVVSRRLKGAVLILQTAVANAETPKYKFVKANCGDSFTIACSYGGTVIGLYPVTMRSNGTIRASFLAGACGSVGFNIEKGVVNFFYMHHLELPNALHTGTDLMGEFYGGYVDEEVAQRVPPDNLVTNNIVAWLYAAIISVKESSFSQPKWLESTTVSIEDYNRWASDNGFTPFSTSTAITKLSAITGVDVCKLLRTIMVKSAQWGSDPILGQYNFEDELTPESVFNQVGGVRLQ\n",
      "The target name is: SARS-CoV 3CL Protease\n"
     ]
    }
   ],
   "source": [
    "target, target_name = dataset.load_SARS_CoV_Protease_3CL()\n",
    "print('The target is: ' + target)\n",
    "print('The target name is: ' + target_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The target is: SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ\n",
      "The target name is: SARS-CoV2 3CL Protease\n"
     ]
    }
   ],
   "source": [
    "target, target_name = dataset.load_SARS_CoV2_Protease_3CL()\n",
    "print('The target is: ' + target)\n",
    "print('The target name is: ' + target_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also support to read from local txt files. For target sequence, we assume it has one line, and the first is the target name, and space, and followed by targe amino acid sequence.\n",
    "\n",
    "RNA_polymerase_SARS_CoV2_target_seq.txt:\n",
    "\n",
    "RNA_polymerase_SARS_CoV2 SADAQS...PHTVLQ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The target is: SADAQSFLNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQEKDEDDNLIDSYFVVKRHTFSNYQHEETIYNLLKDCPAVAKHDFFKFRIDGDMVPHISRQRLTKYTMADLVYALRHFDEGNCDTLKEILVTYNCCDDDYFNKKDWYDFVENPDILRVYANLGERVRQALLKTVQFCDAMRNAGIVGVLTLDNQDLNGNWYDFGDFIQTTPGSGVPVVDSYYSLLMPILTLTRALTAESHVDTDLTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQTYHPNCVNCLDDRCILHCANFNVLFSTVFPPTSFGPLVRKIFVDGVPFVVSTGYHFRELGVVHNQDVNLHSSRLSFKELLVYAADPAMHAASGNLLLDKRTTCFSVAALTNNVAFQTVKPGNFNKDFYDFAVSKGFFKEGSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDIRQLLFVVEVVDKYFDCYDGGCINANQVIVNNLDKSAGFPFNKWGKARLYYDSMSYEDQDALFAYTKRNVIPTITQMNLKYAISAKNRARTVAGVSICSTMTNRQFHQKLLKSIAATRGATVVIGTSKFYGGWHNMLKTVYSDVENPHLMGWDYPKCDRAMPNMLRIMASLVLARKHTTCCSLSHRFYRLANECAQVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNICQAVTANVNALLSTDGNKIADKYVRNLQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMMILSDDAVVCFNSTYASQGLVASIKNFKSVLYYQNNVFMSEAKCWTETDLTKGPHEFCSQHTMLVKQGDDYVYLPYPDPSRILGAGCFVDDIVKTDGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFYEAMYTPHTVLQ\n",
      "The target name is: RNA_polymerase_SARS_CoV2\n"
     ]
    }
   ],
   "source": [
    "target, target_name = dataset.read_file_target_sequence('./toy_data/RNA_polymerase_SARS_CoV2_target_seq.txt')\n",
    "print('The target is: ' + target)\n",
    "print('The target name is: ' + target_name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's move on to drug repurposing library. We currently support an antiviral drugs library and the broad repurposing library. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_repurpose, Drug_Names, Drug_CIDs = dataset.load_antiviral_drugs()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['C1CC1NC2=C3C(=NC(=N2)N)N(C=N3)C4CC(C=C4)CO',\n",
       "       'C1=NC2=C(N1COCCO)NC(=NC2=O)N',\n",
       "       'C1=NC(=C2C(=N1)N(C=N2)CCOCP(=O)(O)O)N'], dtype=object)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_repurpose[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Abacavir', 'Aciclovir', 'Adefovir'], dtype=object)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Drug_Names[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([441300,   2022,  60172])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Drug_CIDs[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above example, the data is downloaded from the cloud and saved into default folder *'./data'*, you can also specify your PATH by *dataset.load_antiviral_drugs(PATH)*. \n",
    "\n",
    "We also allow option to not output PubChem CID by setting *dataset.load_antiviral_drugs(no_cid = True)*, this allows less lines for one line mode DeepPurpose, since in one line mode, the function expects only X_repurpose and Drug_Names. \n",
    "\n",
    "Similarly for Broad Repurposing Hub, we can do the same:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_drug, Drug_Names, Drug_CIDs = dataset.load_broad_repurposing_hub()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['CO\\\\N=C(\\\\C(=O)NC1C2SCC(CSc3nnnn3C)=C(N2C1=O)C(O)=O)c1csc(N)n1',\n",
       "       'CN1CCN(CC1)c1c(F)cc2c3c1SCCn3cc(C(O)=O)c2=O',\n",
       "       'C[C@H]1CN(C[C@@H](C)N1)c1c(F)c(N)c2c(c1F)n(cc(C(O)=O)c2=O)C1CC1'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_drug[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['7-[[(2E)-2-(2-Amino-1,3-thiazol-4-yl)-2-methoxyiminoacetyl]amino]-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-thia-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid',\n",
       "       'Rufloxacin', 'Sparfloxacin'], dtype=object)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Drug_Names[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will first download the file from cloud to local default *'./data'* folder or you can input your data folder. \n",
    "\n",
    "Note that in the one line mode (*oneliner.repurpose()*), if you don't specify any *X_repurpose* library, the method will automatically use the Broad Repurposing Hub data and use the PubChem CIDs as the drug names since some drugs (as you can see from the above examples) are way too long.\n",
    "\n",
    "Now, let's show how you can load your own library using txt file!\n",
    "\n",
    "We assume the txt file consists of the following structure:\n",
    "\n",
    "repurposing_library.txt\n",
    "\n",
    "Rufloxacin CN1CCN(CC1)c1c(F)cc2c3c1SCCn3cc(C(O)=O)c2=O\\\n",
    "Sparfloxacin C[C@H]1CN(C[C@@H](C)N1)c1c(F)c(N)c2c(c1F)n(cc(C(O)=O)c2=O)C1CC1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_drug, Drug_Names = dataset.read_file_repurposing_library('./toy_data/repurposing_data_examples.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['CN1CCN(CC1)c1c(F)cc2c3c1SCCn3cc(C(O)=O)c2=O',\n",
       "       'C[C@H]1CN(CC@@HN1)c1c(F)c(N)c2c(c1F)n(cc(C(O)=O)c2=O)C1CC1'],\n",
       "      dtype='<U58')"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_drug"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Rufloxacin', 'Sparfloxacin'], dtype='<U12')"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Drug_Names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, let's now move to the final training dataset! There are in general two types of training dataset that we expect.\n",
    "\n",
    "1. The drug-target pairs with the binding score or the interaction 1/0 label.\n",
    "2. The bioassay data where there is only one target and many drugs are screened.\n",
    "\n",
    "For the first one, we provide three data loaders for public available drug-target interaction datasets: KIBA, DAVIS, and BindingDB. Let's first talk about DAVIS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Beginning Processing...\n",
      "Beginning to extract zip file...\n",
      "Default set to logspace (nM -> p) for easier regression\n",
      "Done!\n"
     ]
    }
   ],
   "source": [
    "X_drugs, X_targets, y = dataset.load_process_DAVIS(path = './data', binary = False, convert_to_log = True, threshold = 30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N',\n",
       "       'CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N'],\n",
       "      dtype='<U92')"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_drugs[:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMTDPFGSTSDAVIEKADVAVESLIPGLEPPVPQRLPSQTESVTSNRTDSLTGEDSLLDCSLLSNPTTDLLEEFAPTAISAPVHKAAEDSNLISGFDVPEGSDKVAEDEFDPIPVLITKNPQGGHSRNSSGSSESSLPNLARSLLLVDQLIDL'],\n",
       "      dtype='<U2549')"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_targets[:1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([7.36552273, 4.99999566])"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "DAVIS dataloader has several default parameters. The path is the saving path. The binary parameter asks if you want to convert the binding score to binary classification since lots of the models are aimed to do that. The convert_to_log is to transform from the Kd unit from nM to p which has a more normal distribution for easier regression. The threshold is for binary classification, the default is recommended but you could also tune your own.\n",
    "\n",
    "Similarly, for KIBA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Beginning Processing...\n",
      "Beginning to extract zip file...\n",
      "Done!\n"
     ]
    }
   ],
   "source": [
    "X_drugs, X_targets, y = dataset.load_process_KIBA(path = './data', binary = False, threshold = 9)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another large dataset we support is BindingDB. There are three different thing from the KIBA and DAVIS data loader:\n",
    "1. BindingDB is big (several GBs). So we provide a separate function to download BindingDB *download_BindingDB()*, which will return the downloaded file path for you. Then you can set the *path = download_BindingDB()* in the *process_BindingDB()* function.\n",
    "2. BindingDB has four Binding values for drug target pairs: IC50, EC50, Kd, Ki. You should set the 'y' to one of them for the drug target pairs you would like.\n",
    "3. The loading of BindingDB from local file to Pandas is also pretty slow. So instead of putting path into the function, you could also set the df = the bindingDB pandas dataframe object. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Beginning to download dataset...\n",
      "Beginning to extract zip file...\n",
      "Done!\n",
      "Loading Dataset from path...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "b'Skipping line 772572: expected 193 fields, saw 205\\nSkipping line 772598: expected 193 fields, saw 205\\n'\n",
      "b'Skipping line 805291: expected 193 fields, saw 205\\n'\n",
      "b'Skipping line 827961: expected 193 fields, saw 265\\n'\n",
      "b'Skipping line 1231688: expected 193 fields, saw 241\\n'\n",
      "b'Skipping line 1345591: expected 193 fields, saw 241\\nSkipping line 1345592: expected 193 fields, saw 241\\nSkipping line 1345593: expected 193 fields, saw 241\\nSkipping line 1345594: expected 193 fields, saw 241\\nSkipping line 1345595: expected 193 fields, saw 241\\nSkipping line 1345596: expected 193 fields, saw 241\\nSkipping line 1345597: expected 193 fields, saw 241\\nSkipping line 1345598: expected 193 fields, saw 241\\nSkipping line 1345599: expected 193 fields, saw 241\\n'\n",
      "b'Skipping line 1358864: expected 193 fields, saw 205\\n'\n",
      "b'Skipping line 1378087: expected 193 fields, saw 241\\nSkipping line 1378088: expected 193 fields, saw 241\\nSkipping line 1378089: expected 193 fields, saw 241\\nSkipping line 1378090: expected 193 fields, saw 241\\nSkipping line 1378091: expected 193 fields, saw 241\\nSkipping line 1378092: expected 193 fields, saw 241\\nSkipping line 1378093: expected 193 fields, saw 241\\nSkipping line 1378094: expected 193 fields, saw 241\\nSkipping line 1378095: expected 193 fields, saw 241\\n'\n",
      "b'Skipping line 1417264: expected 193 fields, saw 205\\n'\n",
      "/home/kh278/.conda/envs/DeepPurpose/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3254: DtypeWarning: Columns (8,9,10,11,12,13,15,17,19,20,26,27,31,32,34,35,46,49,50,51,52,53,54,61,62,63,64,65,66,73,74,75,76,77,78,85,86,87,88,89,90,97,98,99,100,101,102,109,110,111,112,113,114,121,122,123,124,125,126,133,134,135,136,137,138,145,147,148,149,150,157,158,159,160,161,162,169,171,172,173,174) have mixed types.Specify dtype option on import or set low_memory=False.\n",
      "  if (await self.run_code(code, result,  async_=asy)):\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Beginning Processing...\n",
      "There are 66444 drug target pairs.\n",
      "Default set to logspace (nM -> p) for easier regression\n"
     ]
    }
   ],
   "source": [
    "data_path = dataset.download_BindingDB('./data/')\n",
    "X_drugs, X_targets, y = dataset.process_BindingDB(path = data_path, df = None, y = 'Kd', binary = False, convert_to_log = True, threshold = 30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 66444 drug-target pairs.\n"
     ]
    }
   ],
   "source": [
    "print('There are ' + str(len(X_drugs)) + ' drug-target pairs.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's show how to load it from txt file. We assume it has the following format:\n",
    "\n",
    "dti.txt\n",
    "\n",
    "CC1=C...C4)N MKK...LIDL 7.365 \\\n",
    "CC1=C...C4)N QQP...EGKH 4.999"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_drugs, X_targets, y = dataset.read_file_training_dataset_drug_target_pairs('./toy_data/dti.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N',\n",
       "       'CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N'],\n",
       "      dtype='<U51')"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_drugs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are almost here! Now, in the end, let's look at bioassay data. We only write the AID1706 bioassay loader for now. But please check the source code since it is easy to produce another one. \n",
    "\n",
    "There are several things to look at.\n",
    "\n",
    "1. we have a new balanced parameter. Since bioassay data usually are highly skewed (i.e. only few are hits and most of them are not), for a better training purpose, we can make the data slightly more balanced. \n",
    "2. The ratio of balancing can be tuned by the oversample_num parameter. It states the percentage of unbalanced:balanced data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Beginning Processing...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/kh278/.conda/envs/DeepPurpose/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3254: DtypeWarning: Columns (0,7) have mixed types.Specify dtype option on import or set low_memory=False.\n",
      "  if (await self.run_code(code, result,  async_=asy)):\n",
      "/home/kh278/DeepPurpose/dataset.py:311: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  val['binary_label'][(val.PUBCHEM_ACTIVITY_SCORE >= threshold) & (val.PUBCHEM_ACTIVITY_SCORE <=100)] = 1\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Default binary threshold for the binding affinity scores is 15, recommended by the investigator\n",
      "Done!\n"
     ]
    }
   ],
   "source": [
    "X_drugs, X_targets, y = dataset.load_AID1706_SARS_CoV_3CL(path = './data', binary = True, threshold = 15, balanced = True, oversample_num = 30, seed = 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the end, we show how to load customized bioassay training data. We assume the following format:\n",
    "\n",
    "AID1706.txt\n",
    "\n",
    "SGFKKLVSP...GVRLQ \\\n",
    "CCOC1...C=N4 0 \\\n",
    "CCCCO...=CS2 0 \\\n",
    "COC1=...23)F 0 \\\n",
    "C1=CC...)C#N 1 \\\n",
    "CC(=O...3.Cl 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_drugs, X_targets, y = dataset.read_file_training_dataset_bioassay('./toy_data/AID1706.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['CCOC1=CC=C(C=C1)N2C=CC(=O)C(=N2)C(=O)NC3=CC=C(C=C3)S(=O)(=O)NC4=NC=CC=N4',\n",
       "       'CCCCOC(=O)C1=CC=C(C=C1)NC(=O)/C=C/C2=CC=CS2'], dtype='<U79')"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_drugs[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's all for now! Definitely checkout more demos and tutorials for DeepPurpose! Let us know what you like or don't like for us to improve : )\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:DeepPurpose]",
   "language": "python",
   "name": "conda-env-DeepPurpose-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
